Existing approaches for vision-and-language navigation (VLN) are mainly based on cross-modal reasoning over discrete views. However, this scheme may hamper an agent's spatial and numerical reasoning because of incomplete objects within a single view and duplicate observations across views. A potential solution is mapping discrete views into a unified birds's-eye view, which can aggregate partial and duplicate observations. Existing metric maps could achieve this goal, but they suffer from less expressive semantics (e.g. usually predefined labels) and limited map size, which weakens an agent's language grounding and long-term planning ability. Inspired by the robotics community, we introduce hybrid topo-metric maps into VLN, where a topological map is used for long-term planning and a metric map for short-term reasoning. Beyond mapping with more expressive deep features, we further design a pre-training framework via the hybrid map to learn language-informed map representations, which enhances cross-modal grounding and facilitates the final language-guided navigation goal. Extensive experiments demonstrate the effectiveness of the map-based route for VLN, and the proposed method sets the new state-of-the-art on three VLN benchmarks.
translated by 谷歌翻译
This article proposes a model-based deep reinforcement learning (DRL) method to design emergency control strategies for short-term voltage stability problems in power systems. Recent advances show promising results in model-free DRL-based methods for power systems, but model-free methods suffer from poor sample efficiency and training time, both critical for making state-of-the-art DRL algorithms practically applicable. DRL-agent learns an optimal policy via a trial-and-error method while interacting with the real-world environment. And it is desirable to minimize the direct interaction of the DRL agent with the real-world power grid due to its safety-critical nature. Additionally, state-of-the-art DRL-based policies are mostly trained using a physics-based grid simulator where dynamic simulation is computationally intensive, lowering the training efficiency. We propose a novel model-based-DRL framework where a deep neural network (DNN)-based dynamic surrogate model, instead of a real-world power-grid or physics-based simulation, is utilized with the policy learning framework, making the process faster and sample efficient. However, stabilizing model-based DRL is challenging because of the complex system dynamics of large-scale power systems. We solved these issues by incorporating imitation learning to have a warm start in policy learning, reward-shaping, and multi-step surrogate loss. Finally, we achieved 97.5% sample efficiency and 87.7% training efficiency for an application to the IEEE 300-bus test system.
translated by 谷歌翻译
Generating consistent and high-quality images from given texts is essential for visual-language understanding. Although impressive results have been achieved in generating high-quality images, text-image consistency is still a major concern in existing GAN-based methods. Particularly, the most popular metric $R$-precision may not accurately reflect the text-image consistency, often resulting in very misleading semantics in the generated images. Albeit its significance, how to design a better text-image consistency metric surprisingly remains under-explored in the community. In this paper, we make a further step forward to develop a novel CLIP-based metric termed as Semantic Similarity Distance ($SSD$), which is both theoretically founded from a distributional viewpoint and empirically verified on benchmark datasets. Benefiting from the proposed metric, we further design the Parallel Deep Fusion Generative Adversarial Networks (PDF-GAN) that aims at improving text-image consistency by fusing semantic information at different granularities and capturing accurate semantics. Equipped with two novel plug-and-play components: Hard-Negative Sentence Constructor and Semantic Projection, the proposed PDF-GAN can mitigate inconsistent semantics and bridge the text-image semantic gap. A series of experiments show that, as opposed to current state-of-the-art methods, our PDF-GAN can lead to significantly better text-image consistency while maintaining decent image quality on the CUB and COCO datasets.
translated by 谷歌翻译
尽管条件变异自动编码器(CVAE)模型比传统的SEQ2SEQ模型可以产生更多的多样化响应,但响应通常与输入词的相关性低或与问题不合逻辑。进行因果分析以研究背后的原因,并提供了一种寻找调解人并减轻对话中混杂偏见的方法。具体而言,我们建议预测调解人,以保留相关信息,并自动将调解人纳入生成过程中。此外,动态主题图指导条件变异自动编码器(TGG-CVAE)模型用于补充语义空间并减少响应中的混杂偏置。广泛的实验表明,所提出的模型能够产生相关和信息性的响应,并且在自动指标和人类评估方面优于最先进的响应。
translated by 谷歌翻译
REED继电器是功能测试的基本组成部分,与电子产品的成功质量检查密切相关。为了为REED继电器提供准确的剩余使用寿命(RUL)估计,根据以下三个考虑,提出了具有降解模式聚类的混合深度学习网络。首先,对于REED继电器,观察到多种降解行为,因此提供了基于动态的$ K $ -MEANS聚类,以区分彼此的退化模式。其次,尽管适当的功能选择具有重要意义,但很少有研究可以指导选择。提出的方法建议进行操作规则,以实施轻松实施。第三,提出了用于剩余使用寿命估计的神经网络(RULNET),以解决卷积神经网络(CNN)在捕获顺序数据的时间信息中的弱点,该信息在卷积操作的高级特征表示后结合了时间相关能力。通过这种方式,lulnet的三种变体由健康指标,具有自组织地图的功能或具有曲线拟合的功能构建。最终,将提出的混合模型与典型的基线模型(包括CNN和长期记忆网络(LSTM))进行了比较,该模型通过具有两个不同不同降级方式的实用REED继电器数据集进行了比较。两种降解案例的结果表明,所提出的方法在索引均方根误差方面优于CNN和LSTM。
translated by 谷歌翻译
本文回顾了AIM 2022上压缩图像和视频超级分辨率的挑战。这项挑战包括两条曲目。轨道1的目标是压缩图像的超分辨率,轨迹〜2靶向压缩视频的超分辨率。在轨道1中,我们使用流行的数据集DIV2K作为培训,验证和测试集。在轨道2中,我们提出了LDV 3.0数据集,其中包含365个视频,包括LDV 2.0数据集(335个视频)和30个其他视频。在这一挑战中,有12支球队和2支球队分别提交了赛道1和赛道2的最终结果。所提出的方法和解决方案衡量了压缩图像和视频上超分辨率的最先进。提出的LDV 3.0数据集可在https://github.com/renyang-home/ldv_dataset上找到。此挑战的首页是在https://github.com/renyang-home/aim22_compresssr。
translated by 谷歌翻译
营销活动是一系列战略活动,可以促进企业的目标。在真正的工业场景中,营销活动的效果预测非常复杂且具有挑战性,因为通常从观察数据中学到了先验知识,而没有任何营销活动干预。此外,每个主题始终在几个营销活动的干预下同时受到干扰。因此,我们无法轻松解析和评估单个营销活动的效果。据我们所知,目前尚无有效的方法来解决此类问题,即,基于具有多个相互缠绕事件的层次结构对个体级别的预测任务进行建模。在本文中,我们对效果预测任务中涉及的基础解析树的结构进行了深入的分析,并进一步建立了一个层次结构胶囊预测网络(HAPNET)来预测营销活动的影响。基于合成数据和实际数据的广泛结果证明了我们模型比最新方法的优越性,并在实际工业应用中表现出显着的实用性。
translated by 谷歌翻译
歌词到融合的生成是歌曲创作的重要任务,并且由于其独特的特征也很具有挑战性:产生的旋律不仅应遵循良好的音乐模式,而且还应与节奏和结构等歌词中的功能保持一致。由于几个问题,这些特征无法通过以端到端学习抒情式映射的神经生成模型来很好地处理:(1)缺乏对齐的抒情式摩托律训练数据,以充分学习抒情液特征结盟; (2)发电中缺乏可控性,无法明确保证抒情特征对齐。在本文中,我们提出了ROC,这是一种新的抒情术的范式,该范式通过一代网络式管道解决了上述问题。具体而言,我们的范式有两个阶段:(1)创建阶段,其中大量音乐是由基于神经的旋律语言模型生成的,并通过几个关键功能(例如和弦,音调,节奏和节奏和节奏)在数据库中索引。结构信息,包括合唱或经文); (2)重新创建阶段,根据歌词的关键功能从数据库中检索音乐作品,并根据构图指南和旋律语言模型分数从数据库中检索音乐作品来重新创建旋律。我们的ROC范式具有多个优点:(1)它只需要未配对的旋律数据来训练旋律语言模型,而不是以前模型中配对的抒情数据。 (2)它在抒情循环的生成中实现了良好的抒情式特征对齐。关于英语和中文数据集的实验表明,ROC在客观和主观指标上都优于先前基于神经的抒情性循环模型。
translated by 谷歌翻译
稀疏的一般矩阵乘法(SPGEMM)是许多科学应用中的基本构件。 SPGEMM的一项关键任务是计算或预测有效的内存分配和负载平衡的输出矩阵的结构(即,每个输出行的非零元素的数量),这会影响SPGEMM的整体性能。现有工作要么精确地计算出输出结构,要么采用基于上限或采样的方法来预测输出结构。但是,这些方法要么需要太多执行时间,要么不够准确。在本文中,我们提出了一种基于采样的新方法,与现有基于采样的方法相比,具有更好的精度和低成本。该方法首先通过利用中间产品的数量(表示为flop)和同一采样结果矩阵的非零元素(表示为NNZ)来预测SPGEMM的压缩比。然后,通过将每次输出行除以预测的压缩率来获得预测的输出结构。我们还建议使用优化的计算开销的基于采样的方法的参考设计,以证明所提出的方法的准确性。我们构建具有各种矩阵维度和稀疏结构的625个测试用例,以评估预测准确性。实验结果表明,在最坏的情况下,所提出方法和参考设计的绝对相对误差分别为1.56 \%和8.12 \%,分别为25 \%和156 \%。
translated by 谷歌翻译
最新的多视图多媒体应用程序在高分辨率(HR)视觉体验与存储或带宽约束之间挣扎。因此,本文提出了一个多视图图像超分辨率(MVISR)任务。它旨在增加从同一场景捕获的多视图图像的分辨率。一种解决方案是将图像或视频超分辨率(SR)方法应用于低分辨率(LR)输入视图结果。但是,这些方法无法处理视图之间的大角度转换,并利用所有多视图图像中的信息。为了解决这些问题,我们提出了MVSRNET,该MVSRNET使用几何信息从所有LR多视图中提取尖锐的细节,以支持LR输入视图的SR。具体而言,MVSRNET中提出的几何感知参考合成模块使用几何信息和所有多视图LR图像来合成像素对齐的HR参考图像。然后,提出的动态高频搜索网络完全利用了SR参考图像中的高频纹理细节。关于几个基准测试的广泛实验表明,我们的方法在最新方法上有了显着改善。
translated by 谷歌翻译